NAR Genomics and Bioinformatics
◐ Oxford University Press (OUP)
Preprints posted in the last 90 days, ranked by how well they match NAR Genomics and Bioinformatics's content profile, based on 214 papers previously published here. The average preprint has a 0.11% match score for this journal, so anything above that is already an above-average fit.
Santos, O. J.; Dalmolin, R. J.; de Almeida, R. M. C.
Show abstract
Single-cell RNA sequencing (single-cell RNA-seq) has represented a revolution in gene expression analysis. However, high dropout rates and stochastic noise often reduce the amount of information captured in these experiments. The epithelial-mesenchymal transition (EMT), which is fundamental to tumor progression and organismal development, is particularly difficult to fully characterize due to the existence of intermediate states. In this work, we demonstrate that projecting transcriptomic data onto gene lists ordered using protein-protein interaction (PPI) information acts as a "biological low-pass filter", attenuating technical noise and increasing the statistical power of the analyses. We propose and validate an innovative pipeline that integrates the Transcriptogram method with Principal Component Analysis (PCA). By applying a moving average over functionally ordered genes, we drastically increase the signal-to-noise ratio, enabling the inference of cellular trajectories. The method was applied to a public dataset of TGF-{beta}1-induced MCF10A cells, with rigorous batch-effect correction based on biological controls. The results reveal that EMT is not merely a morphological change, but a coordinated, systemic reprogramming. This approach enabled the identification of critical modules that would remain hidden in conventional analyses: (i) a massive "Metabolic Switch" (Cluster 2), indicating a transition toward oxidative phosphorylation to sustain invasion; (ii) a strategic blockade of the cell cycle (Cluster 4); and (iii) a "Detoxification Shield" and chemoresistance program (Cluster 5), characterized by endogenous activation of metallothioneins. We conclude that the combination of PPI network topology and dimensionality reduction offers superior resolution for dissecting cellular plasticity. The method not only validates classical markers, but also reveals the hidden functional architecture of the transition, showing that EMT is not a single, uniform process, but rather one in which cells can follow distinct trajectories, halting at different stages of differentiation.
Davinack, A. A.
Show abstract
Haplotype networks are widely used in population genetics and phylogeography to visualize genealogical relationships among DNA sequences and to infer population structure, historical connectivity, and demographic processes. Existing software for haplotype network construction relies primarily on interactive graphical interfaces, which limits reproducibility, automation, and integration into modern bioinformatic workflows. Here, I introduce HapNet, an open-source Python package that enables automated construction, visualization, and summarization of haplotype networks directly from aligned FASTA files. HapNet is the first Python-native package designed specifically for automated, population-aware haplotype network construction and visualization from aligned FASTA files. HapNet implements a minimum-spanning-tree approach based on Hamming distances among haplotypes and incorporates population metadata encoded in sequence headers to produce population-aware network visualizations in which shared haplotypes are represented as pie charts and node sizes scale with haplotype frequency. In addition to a publication-ready network, HapNet generates machine-readable tabular output describing haplotype composition, population membership, and shared versus private haplotypes, facilitating downstream statistical analysis and reproducibility. Here, HapNets utility is demonstrated using mitochondrial DNA sequences from the shell-boring polychaete worm Polydora neocaeca, illustrating how the software reveals patterns of population connectivity and haplotype sharing. HapNet provides a reproducible, scriptable alternative to existing graphical tools and is freely available via the Python Package Index and GitHub.
He, Z.; Florea, L.
Show abstract
Recent foundation and deep learning models have brought a generational leap in improving the quality of genome annotation, particularly in identifying genes and their structural elements, including exons and splice sites. However, they are trained on reduced datasets that may not capture biological complexity, such as differences between coding versus non-coding, terminal versus internal, constitutive versus alternatively spliced, and transposable element (TE)-derived exons. We evaluate several foundation models for gene and splice site annotation, including the transformer-based SegmentNT, Enformer and Borzoi, coupled with a segmentation head for per-base resolution, and the CNN-based SpliceAI and AlphaGenome, along with a newly developed fine-tuned model, STEP2h, on different classes of gene elements as described above. We found that the performance of all methods is highest for the class of exons found in their training data class and decreases drastically for classes of exons poorly represented. In particular, performance is highest for protein-coding genes, coding exons, and constitutive exons, and decreases drastically by up to 2-4 fold for non-coding internal exons, terminal exons, and exons that undergo alternative splicing. Similarly, performance is impaired on LINE-1 and Alu-derived exons. In contrast, a locally developed CNN model fine-tuned on a specialized TE-exon dataset showed improved performance in this category. Our study highlights the outstanding challenges in gene and exon annotation when leveraging powerful foundation models, and the need for further fine-tuning on judiciously selected classes of data or task-specific models to capture a broader, more diverse spectrum of gene features.
Dobramysl, U.; Wheeler, R. J.
Show abstract
Trypanosomatid parasites, including human infective Leishmania and Trypanosoma species, have an unusual genome organisation and transcription. They are unicellular eukaryotes, but unlike most eukaryotes, which have individual promoters per gene, most protein coding genes are co-transcribed in long gene arrays. This nascent transcript is processed into individual mRNAs by trans-splicing and polyadenylation. Accurate analysis of transcription, transcript processing and transcript abundance requires accurate genome annotation of spliced leader acceptor sites, polyadenylation sites and the resulting 5' and 3' mRNA untranslated regions. Here, we describe tools for annotating these features from short read RNA sequencing data and for measuring the usage of spliced leader acceptor and polyadenylation sites. These are practical, scalable software packages, and we use them to annotate UTRs across all available trypanosomatid genomes.
Shintani, M.; Andrade, D.; Bono, H.
Show abstract
Although the Gene Expression Omnibus and other public repositories are expanding rapidly, curation across these databases has not kept pace. Data reuse is often hindered by unstandardized metadata comprising unstructured text. To address this, we developed a workflow that combines retrieval via an application programming interface with semantic filtering using large language models (LLMs) for automated curation. We benchmarked multiple LLMs using metadata from 150 candidate Arabidopsis RNA sequencing projects to classify samples treated with exogenous abscisic acid and their controls. Simple keyword searches yielded many false positives (F1=0.59); classification using LLMs significantly improved performance. Several open-weight models achieved a nearly perfect performance (F1>0.98), comparable to that of closed models. We also found that utilizing LLM confidence scores enables high-confidence cases to be processed automatically. These results suggest that open-weight LLMs can support scalable and reproducible metadata curation in local environments, providing a foundation for accelerating public dataset reuse.
WANG, Z.; Arsuaga, J.
Show abstract
Computational bacteriophage host prediction from genomic sequences remains challenging because host range depends on diverse, rapidly evolving genomic determinants--from receptor-binding proteins to anti-defense systems and downstream infection compatibility--and because the signals available to predictors, including sequence homology, CRISPR spacer matches, nucleotide composition, and mobile genetic elements, are sparse, unevenly distributed across taxa, and constrained by incomplete host annotations. Here, we frame host prediction as an unsupervised retrieval problem. We asked whether embeddings from the pretrained genome language model Evo2 captured a reliable host-range signal without training on phage-host labels. We generated whole-genome embeddings for phages and candidate bacterial hosts with the Evo2-7B model, applied normalization, and ranked hosts by cosine similarity. Using the Virus-Host Database, we selected embedding and fusion choices on a Gram-positive validation cohort and then evaluated the approach on a held-out Gram-negative test cohort to minimize data leakage. We found that Evo2 was strongest at retrieving multiple plausible hosts, with the recorded host in the top 10 for 55.4% of phages. However, it did not maximize species-level top-1 accuracy (19.4% vs. 23.2% for the best baseline). At higher taxonomic ranks, Evo2 captured a coarser host-range signal: top-1 accuracy reached 43.4% at the genus level and 51.6% at the family level. Reciprocal rank fusion of Evo2 with BLASTN, VirHostMatcher, and PHIST improved all retrieval metrics. Top-10 retrieval rose to 58.5% and top-1 accuracy to 26.9%. Stratified analyses by phage genome length, host clade, and host mobile genetic element coverage revealed scenario-dependent performance. Evo2 embeddings excelled for intermediate-length phages and when host mobile element content was low, whereas alignment and k-mer methods dominated when local homology was abundant. These results suggest that pretrained genome embeddings complement established alignment- and k-mer/composition-based methods and that context-aware hybrid pipelines may help improve phage host prediction. Author summaryBacteriophages are viruses that prey on bacteria and play central roles in microbial ecosystems, nutrient cycling, and the spread of antibiotic resistance genes. Knowing which bacterium a phage can infect is important for applications such as phage therapy, where viruses are used to treat bacterial infections, but making this prediction from DNA sequence data alone remains difficult. Existing computational tools each exploit different types of genomic evidence, and none works reliably across all settings. We asked whether an artificial intelligence model trained to read raw DNA--without ever being shown which phages infect which hosts--could contribute a new, complementary signal. We found that this approach was particularly effective at narrowing the field to a short list of candidate hosts and at capturing broad evolutionary relationships between phages and bacteria. When we combined it with established sequence-comparison tools, overall prediction improved beyond what any single method achieved alone. By examining when each method succeeded or failed, we identified biological factors that govern prediction difficulty, offering practical guidance for building more robust prediction systems.
Pose-Lagoa, I.; Urda-Garcia, B.; Olvera, N.; Sanchez-Valle, J.; Faner, R.; Valencia, A.; Carbonell-Caballero, J.
Show abstract
Complex and clinically heterogeneous diseases pose significant challenges for gene prioritisation and patient stratification, as relevant genes often show weak or context-specific signals and transcriptomic datasets are limited in size. These limitations hinder the discovery of robust molecular signatures using traditional case-control approaches and motivate computational pipelines capable of capturing molecular diversity. Here, we present an explainable ensemble-based AI pipeline to prioritise disease-relevant genes from transcriptomic data, using Chronic Obstructive Pulmonary Disease (COPD) as a use case. To retain biologically relevant interactors obscured by molecular heterogeneity, the framework integrates data-driven signals with curated COPD-related gene sets, further expanded through network-based prioritisation and supported by molecular interactions. Gene relevance is evaluated via aggregated explainability scores across multiple classifier configurations to ensure robust candidate selection. The final set comprised < 8% of evaluated genes, [~] 62% arising from network-based expansion, substantially reducing dimensionality while preserving biological heterogeneity. Beyond case-control classification, the approach identified candidate genes and molecular subgroups associated with specific clinical features, capturing patient-level heterogeneity. The prioritised genes recapitulated key disease-related processes, including immune responses and extracellular matrix degradation, and highlighted additional associations like the enrichment of the IL-4 and IL-13 signalling pathway, which is of clinical interest given ongoing biologic developments targeting these axes. Our pipeline outperformed existing methods in discriminating COPD from controls, and the final gene list was validated in independent cohorts. Implemented as a scalable and reusable R package, this framework facilitates the study of molecular heterogeneity in complex diseases like COPD, supporting advances in diagnosis and precision medicine. Availability and implementationEBEx code and tutorials can be found in: https://iposelag.github.io/EBEx/
Warr, M. J.; Dinh, T.; Root, B.; Onstott, E.; Yu, K.; Mudge, J.; Ramaraj, T.; Kahanda, I.; Mumey, B.
Show abstract
In this work, we investigate using motif subsequence features to predict whether a genomic region is accessible to regulatory proteins, i.e. an accessible chromatin region (ACR), enabling transcription of associated genes. We focus on plants, whose agricultural and ecological importance make them interesting and important organisms to study, and whose complex genomes provide important stress tests for our algorithm. We show that motif sequence similarity as found by co-linear chaining can be used in combination with machine learning models to effectively predict ACRs in genome assemblies.
Karapliafis, D.; Neri, U.; Olendraite, I.; Charon, J.; Sakaguchi, S.; Hou, X.; de Ridder, D.; Zwart, M. P.; Kupczok, A.
Show abstract
Recent advances in metatranscriptomics and large-scale mining of publicly available sequencing datasets have substantially expanded our knowledge of RNA virus diversity. Most genome mining approaches for detecting RNA viruses that encode RNA-dependent RNA polymerase (RdRp) rely on identifying this conserved protein, which is essential for the replication of RNA virus genomes. These approaches employ evolutionarily informed profile Hidden Markov Models (pHMMs) to scan large sequencing datasets for RdRp sequences. Recently, several new pHMM databases for RdRp detection have been released, each with distinct design principles, making it unclear which database is best for specific applications. Furthermore, these resources may be inaccessible to users without specialized computational expertise. Here we introduce the RdRp Collaborative Analysis Tool with Collections of pHMMs (RdRpCATCH: https://github.com/dimitris-karapliafis/RdRpCATCH), developed to consolidate publicly available RdRp pHMM resources into a single, accessible platform. RdRpCATCH enables the scanning of (meta)transcriptomic assemblies to discover RNA viruses and provides subsequent taxonomic annotation of detected contigs. A comparative analysis of RdRp pHMM databases reveals that most are highly effective at detecting known diversity of RNA viruses while minimizing false positives, supporting their joint use within RdRpCATCH. Certain databases are optimized for efficient scanning or exhibit high sensitivity, and we outline recommendations for their optimal use. RdRpCATCH is distributed as both a conda package and a web server application (https://rdrpcatch.bioinformatics.nl), facilitating access for researchers with diverse expertise. By integrating multiple pHMM resources, this unified framework addresses fragmentation in the field and reduces technical barriers to enable comprehensive viral discovery.
Kim, S. S.; Jackson, J. T.; Zhang, H. B.; Kim, M.
Show abstract
3D genome mapping technologies ChIA-PET, HiChIP, PLAC-seq, HiCAR, and ChIATAC yield pairwise contacts and a one-dimensional signal indicating protein binding or chromatin accessibility. However, a lack of computational tools to quantify the reproducibility of these enrichment-based 3C data prevents rigorous data quality assessment and interpretation. We developed HiChIA-Rep, an algorithm incorporating both 1D and 2D signals to measure similarity via graph signal processing methods. HiChIA-Rep can distinguish biological replicates from non-replicates, cell lines, and protein factors, outperforming tools designed for Hi-C data. With a large amount of multi-ome datasets being generated, HiChIA-Rep will likely be a fundamental tool for the 3D genomics community.
Cheng, Y.; Kettlewell, T.; Laidlaw, R. F.; Hardy, O. M.; McCluskey, A.; Otto, T. D.; Somma, D.
Show abstract
Accurate identification of differentially expressed genes (DEGs) in single-cell RNA sequencing (scRNA-seq) data remains challenging. Single-cell-specific statistical models often report large numbers of candidate genes but can exhibit inflated false positive rates, whereas pseudobulk approaches improve false discovery control at the cost of reduced sensitivity. To overcome the noise and bias that other tools have, and allow the user to have more control of the DEG process, we present CellDEEP, which uses a cell aggregation (metacell) approach. This tool provides a framework for flexible selection of pooling strategies and parameterisation for differential expression analysis (DE). Benchmarking on simulated and real datasets, including COVID-19 and rheumatoid arthritis, shows that CellDEEP often outperforms other methods, consistently reduces false positives compared to single-cell methods and recovers more true positives than pseudobulk methods. Our work shifts the focus from selecting a single "best" method to an approach that reduces cell-level noise while preserving biological signal, together with transparent validation framework, advancing more reliable differential-expression analysis in single-cell transcriptomics. Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=189 HEIGHT=200 SRC="FIGDIR/small/710522v1_ufig1.gif" ALT="Figure 1"> View larger version (35K): org.highwire.dtl.DTLVardef@14692f9org.highwire.dtl.DTLVardef@5b37d6org.highwire.dtl.DTLVardef@aece11org.highwire.dtl.DTLVardef@5ade3d_HPS_FORMAT_FIGEXP M_FIG C_FIG
Magalhaes, H.; Weber, J.; Klau, G. W.; Marschall, T.; Prodanov, T.
Show abstract
Variation of sequence copy number (CN) between individuals can be associated with phenotypical differences. Consequently, CN calling is an important step for disease association and identification, as well as for genome assembly validation. Traditionally, CN calling is done by mapping sequencing reads to a linear reference genome and estimating the CN from the observed read depth. This approach, however, is significantly hampered by sequences and rearrangements not present in a linear reference genome; at the same time simple CN prediction for individual graph nodes does not make use of the graph topology and can lead to inconsistent results. To address these issues, we propose Floco, a method for CN calling with respect to a genome graph using a network flow formulation. Given a graph and alignments against that graph, we calculate raw CN probabilities for every graph node based on the Negative Binomial distribution and the base pair coverage across the node, and then use integer linear programming to compute the CN flow through the whole graph. We tested this approach on 15 aligned datasets, involving three different graphs, as well as HiFi and ONT sequencing reads and linear assemblies split into reads. These results demonstrate that the addition of the network flow formulation increases the accuracy of CN predictions by up to 43% when compared with read depth based estimation alone. Additionally, we observed that concordance between predictions from the three different sequence sources was able to reach 93.2%. Floco fills a gap in CN calling tools specifically designed for genome graphs.
Trussart, M.; Foroutan, M.; Milton, M.; Beltran, H.; Speed, T. P.; Molania, R.
Show abstract
Unwanted variation refers to any source of variability in the data that can compromise down-stream analysis. Effective removal of such variation from gene expression data is essential to derive accurate and meaningful biological results. We refer to this process as normalization. Data may come from a single study or from multiple studies with different sources of unwanted variation. We have previously developed the RUV-III method for normalizing omics data with a strong focus on transcriptomics. Initially, we introduced RUV-III for the normalization of Nanostring nCounter gene expression data, utilizing genuine technical replicates and pseudo-replicates as control samples. Subsequently, we proposed RUV-III with pseudo-replicates of pseudo-samples (PRPS), and which demonstrated its potential in mitigating the effects of different sources of unwanted variation in large and complex RNA-seq studies. To enhance accessibility and performance of this method, we present a new comprehensive R package named RUVprps. The package offers over 100 functions including ones for assessing variation in both biological and unwanted variables, an automated RUV-III normalization process, and metrics for evaluating the effectiveness of the resulting normalizations. Further, it introduces several new features such as ways of identifying unknown sources of unwanted variation, strategies to identify suitable negative control genes, and methods for generating PRPS when information on the biological and unwanted variation is unavailable. The package also implements a faster approach to RUV-III normalization, streamlining its application to large RNA-seq datasets. Our freely available R package and normalization assessment pipeline can help find effective data normalization methods for new data and help benchmark new methods.
Wolfram-Schauerte, M.; Trust, C.; Waffenschmidt, N.; Nieselt, K.
Show abstract
Time-resolved transcriptomic profiling has been used to study phage-host interactions for more than a decade. However, the resulting datasets are not readily accessible for custom re-analysis, and resources are lacking that provide standardized processing, storage, and analysis of transcriptomes from phage infections. Here, we present the PhageExpressionAtlas, the first bioinformatics resource for storing time-resolved dual RNA-sequencing data from phage infections. This data was processed uniformly using a custom analysis pipeline and is presented for interactive exploration through visualisation. The PhageExpressionAtlas currently hosts 42 datasets from 23 studies. Using the PhageExpressionAtlas, we replicate key findings from original publications and extend hypothesis testing across multiple phage-host systems. By systematically querying and analyzing the underlying database, we evaluate approaches to phage gene classification and show that uncharacterized phage genes are expressed across all infection phases. Moreover, we provide a comprehensive view of the expression dynamics of anti-phage defenses as well as host- and phage-encoded anti-defense systems in the infection context, indicating unique and conserved patterns of transcriptional regulation underlying bacterial anti-phage immunity and phage counter-strategies. Together, the PhageExpressionAtlas is a unifying resource that democratizes transcriptomics-driven analyses of phage-host interactions and supports integrative cross-study assessment.
Forcier, T.; Cheng, E.; Tam, O. H.; Wunderlich, C.; Castilla-Vallmanya, L.; Jones, J. L.; Quaegebeur, A.; Barker, R. A.; Jakobsson, J.; Gale Hammell, M.
Show abstract
Transposable elements (TEs) are mobile genetic sequences that can generate new copies of themselves via insertional mutations. These viral-like sequences comprise nearly half the human genome and are present in most genome wide sequencing assays. While only a small fraction of genomic TEs have retained their ability to transpose, TE sequences are often transcribed from their own promoters or as part of larger gene transcripts. Accurately assessing TE expression from each individual genomic TE locus remains an open problem in the field, due to the highly repetitive nature of these multi-copy sequences. These issues are compounded in single-cell and single-nucleus transcriptome experiments, where additional complications arise due to sparse read coverage and unprocessed mRNA introns. Here we present our tool for single-cell TE and gene expression analysis, TEsingle. Using synthetic datasets, we show the problems that arise when not properly accounting for intron retention events, failing to address uncertainty in alignment scoring, and failing to make use of unique molecular identifiers for transcript resolution. Addressing these challenges has enabled an accurate TE analysis suite that simultaneously tracks gene expression as well as locus-specific resolution of expressed TEs. We showcase the performance of TEsingle using single-nucleus profiles from substantia nigra (SN) tissues of Parkinsons Disease (PD) patients. We find examples of young and intact TEs that mark dopaminergic neurons (DA) as well as many young TEs from the LINE and ERV families that are elevated in PD neurons and glia. These results demonstrate that TE expression is highly cell-type and cellular-state specific and elevated in particular subsets of neurons, astrocytes, and microglia from PD patients.
Patel, H.; Crosslin, D.; Jarvik, G. P.; Hall, T.; Veenstra, D.; Xie, S.
Show abstract
The lack of user-centered design principles in the current landscape of commonly-used bioinformatics software tools poses challenges for novice genomics researchers (NGRs) entering the genomics ecosystem. Comparing the usability of one analysis software to that of another is a non-trivial task and requires evaluation criteria that incorporates perspectives from both existing literature and a diverse, underrepresented user base of NGRs. To better characterize these barriers, we utilized a two-pronged approach consisting of a literature review of existing bioinformatics tools and semi-structured interviews of the needs of NGRs. From both knowledge sources, the key attributes that resulted in poor adoption and sustained use of most bioinformatics tools included poor documentation, lack of readily-accessible informational content, challenges with installation and dependency coordination, and inconsistent error messages/progress indicators. Combining the findings from the literature review and the insights gained by interviewing the NGRs, an evaluation rubric was created that can be utilized to grade existing and future bioinformatics tools. This rubric acts as a summary of key components needed for software tools to cater to the diverse needs of both NGRs and experienced users. Due to the rapidly evolving nature of genomics research, it becomes increasingly important to critically evaluate existing tools and develop new ones that will help build a strong foundation for future exploration.
Abbasi, M.; Ochoa Zermeno, S.; Spendlove, M. D.; Tashi, Z.; Plaisier, C. L.; Bartelle, B. B.
Show abstract
Interpretable representations of gene expression are used to define cellular identities and the molecular programs active within cells, two related, but distinct phenomena. In the case of microglia, a cell type with high transcriptomic, functional, and morphological heterogeneity, the predominant representation of transcriptomic data presumes the adoption of distinct molecular identities, despite a lack of easily separable transcriptional states. Here, we explore alternative transcriptomic representations by comparing two single-cell analysis methods: differential expression analysis for identities and co-expression network analysis for molecular programs. For microglia, co-expression network analysis identifies highly significant functional ontologies not resolved by differential expression analysis. The identified co-expression modules are preserved across transcriptomic datasets and suggest reducible functional programs that activate and modulate depending on context. We conclude that co-expression analysis constitutes a best practice for single cell analysis of an individual cell type and describing microglia function as concurrent molecular programs offers a more parsimonious model of microglia function.
Shydlouskaya, V.; Haeryfar, S. M. M.; Andrews, T. S.
Show abstract
Single-cell RNA sequencing (scRNA-seq) has enabled large-scale characterization of cellular heterogeneity; yet, integrating datasets generated through different library preparation protocols remains challenging. For instance, comparisons between 10X Genomics 3' and 5' chemistries are complicated by protocol-dependent technical biases imposed by differences in transcript end capture and amplification. While normalization, and often batch correction, is an integral step in preprocessing scRNA-seq datasets, it remains unclear which correction is most appropriate, or even necessary, for reliable cross-protocol comparisons. Here, we systematically characterize protocol-related expression differences using 35 matched donors across six tissues profiled with both 3' and 5' scRNA-seq approaches. We find that gene expression discrepancies are not pervasive across the whole transcriptome, but driven instead by a relatively small, reproducible subset of protocol-biased genes. Excluding these genes improves cross-protocol concordance, indicating that most genes are directly comparable without aggressive correction. We then benchmark commonly employed normalization approaches and show that while several methods, such as fastMNN, improve statistical alignment when cell populations are well matched, they can distort gene-level signals and inflate differential expression in biologically realistic settings with incomplete cell-type overlap. Taken together, our results demonstrate that protocol bias between 3' and 5' scRNA-seq is limited in scope and that targeted handling of a small set of biased genes presents an alternative approach to normalization or batch correction strategies. This work provides a practical guideline for integrating 3' and 5' scRNA-seq data and highlights the importance of matching normalization strategies to the structure of technical variation and the intended downstream analyses.
Seluzicki, A.; Lee, T. A.; Hartwick, N. T.; Michael, T. P.; Ecker, J. R.; Chory, J.
Show abstract
Transcriptome analysis via RNA sequencing (RNAseq) has become a ubiquitous method of molecular characterization from whole organisms, dissected tissues, and single cells. These experiments have provided an extraordinary volume of data describing the molecular states and responses to many conditions. However, standard approaches to RNAseq analysis commonly use expression level filters that eliminate potentially useful data in the service of decreasing noise. Here we describe the implementation of a coefficient of variation-based filter for RNAseq gene expression data. This filter prioritizes consistent data across replicates, allowing lowly-expressed genes with low-variation measurements to be retained for downstream analysis. We show that, in our Arabidopsis RNAseq data set, this filter allows for the inclusion of many more transcription factors than even a low-stringency expression level filter. We find that these lowly-expressed genes mark specific cell clusters in our single-nucleus (sn)RNAseq dataset. We further characterize communities of co-expressed genes, sampled across the day at two growth temperatures, in relation to snRNAseq cell clusters, finding evidence for a highly photosynthetic cell population, and a cell state marked by high cell division and translation. These methods can be expanded to RNAseq analysis in many systems, facilitating the construction of more detailed models of tissue-specific gene regulatory networks.
Sabogal-Rodriguez, D.; Caro-Quintero, A.
Show abstract
Metagenome-assembled genomes (MAGs) are routinely recovered from metagenomic studies, yet the population genetic information embedded within these datasets remains largely underutilized. Analyzing within-species genetic variation can reveal adaptive evolution, selection pressures, and ecological dynamics that are hidden when MAGs are treated as homogeneous entities. Existing tools address individual analysis steps in isolation, requiring manual integration and creating barriers for researchers without extensive bioinformatics expertise. Here we present PopMAG, a Nextflow pipeline and interactive Shiny application that automates population genetics analysis of MAGs. PopMAG integrates quality control, community profiling, competitive read mapping, functional annotation, and microdiversity estimation into a single reproducible workflow. The pipeline calculates key population genetics metrics including nucleotide diversity ({pi}), pN/pS ratios, fixation index (FST), Levins index and SNVs counts with results consolidated into an interactive visualization platform for metadata-driven exploration. We demonstrate PopMAGs utility through analysis of longitudinal cystic fibrosis lung metagenomes, where we detect signatures of antibiotic-driven selection in Pseudomonas aeruginosa efflux pump genes coinciding with treatment intervention. Availability and implementationPopMAG and corresponding documentation are publicly available at https://github.com/daasabogalro/PopMAG.